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Abstract 

This paper studies quantum annealing (QA) 
for clustering, which can be seen as an exten- 
sion of simulated annealing (SA). We derive 
a QA algorithm for clustering and propose an 
annealing schedule, which is crucial in prac- 
tice. Experiments show the proposed QA al- 
gorithm finds better clustering assignments 
than SA. Furthermore, QA is as easy as SA 
to implement. 



1 Introduction 

Clustering is one of the most popular methods in data 
mining. Typically, clustering problems are formulated 
as optimization problems, which are solved by algo- 
rithms, for example the EM algorithm or convex relax- 
ation. However, clustering is typically NP-hard. The 
simulated annealing (SA) (Kirkpatrick et al., 1983) 
is a promising candidate. Geman and Geman (1984) 
proved SA was able to find the global optimum with 
a slow cooling schedule of temperature T. Although 
their schedule is in practice too slow for clustering of 
a large amount of data, it is well known that SA still 
finds a reasonably good solution even with a faster 
schedule than what Geman and Geman proposed. 

In statistical mechanics, quantum annealing (QA) has 
been proposed as a novel alternative to SA (Apolloni 
et al., 1989; Kadowaki and Nishimori, 1998; Santoro 
et al., 2002; Matsuda et al., 2009). QA adds another 
dimension, F, to SA for annealing, see Fig.l. Thus, it 
can be seen as an extension of SA. QA has succeeded 
in specific problems, e.g. the Ising model in statisti- 
cal mechanics, and it is still unclear that QA works 
better than SA in general. We do not actually think 
QA intuitively helps clustering, but we apply QA to 
clustering just as procedure to derive an algorithm. 
A derived QA algorithm depends on the definition of 
quantum effect TYq. We propose quantum effect TYq, 



which leads to a search strategy fit to clustering. Our 
contribution is, 

1. to propose a QA-based optimization algorithm for 
clustering, in particular 

(a) quantum effect Hq for clustering 

(b) and a good annealing schedule, which is cru- 
cial for applications, 

2. and to experimentally show the proposed al- 
gorithm optimizes clustering assignments better 
than SA. 

We also show the proposed algorithm is as easy as SA 
to implement. 

The algorithm we propose is a Markov chain Monte 
Carlo (MCMC) sampler, which we cah QA-ST sam- 
pler. As we explain later, a naive QA sampler is in- 
tractable even with MCMC. Thus, we approximate 
QA by the Suzuki- Trotter (ST) expansion (Trotter, 
1959; Suzuki, 1976) to derive a tractable sampler, 
which is the QA-ST sampler. QA-ST looks like parallel 
m SAs with interaction / (see Fig. 2). At the begin- 
ning of the annealing process, QA-ST is almost the 
same as m SAs. Hence, QA-ST finds m (local) optima 
independently. As the annealing process continues, in- 
teraction / in Fig. 2 becomes stronger to move m states 
closer. QA-ST at the end picks up the state with the 
lowest energy in m states as the final solution. 

QA-ST with the proposed quantum effect TYq works 
well for clustering. Fig. 3 is an example where data 
points are grouped into four clusters, ai and (72 are 
locally optimal and a* is globally optimal. Suppose m 
is equal to two and ai and <J2 in Fig. 2 correspond to ai 
and (72 in Fig. 3. Although ai and (J2 are local optima, 
the interaction / in Fig. 2 allows ai and (72 to search 
for a better clustering assignment between ai and (72. 
Quantum effect TYq defines the distance metric of clus- 
tering assignments. In this case, the proposed TYq lo- 
cates (7* between ai and (72. Thus, the interaction / 
gives good chance to go to cr* because / makes ai and 
(72 closer (see Fig. 2). The proposed algorithm actually 




Figure 1: Quantum annealing (QA) 
adds another dimension to simulated 
annealing (SA) to control a model. 
QA iteratively decreases T and T 
whereas SA decreases just T. 
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Figure 2: Illustrative explanation of QA. The left figure 
shows m independent SAs, and the right one is QA algo- 
rithm derived with the Suzuki- Trotter (ST) expansion, cj 
denotes a clustering assignment. 



finds a* from cji and (72. Fig. 3 is just an example. 
However, a similar situation often occurs in cluster- 
ing. Clustering algorithms in most cases give "almost" 
globally optimal solutions like ai and (72, where the 
majority of data points are well-clustered, but some 
of them are not. Thus, a better clustering assignment 
can be constructed by picking up well-clustered data 
points from many sub-optimal clustering assignments. 
Note an assignment constructed in such a way is lo- 
cated between the sub-optimal ones by the proposed 
quantum effect TYq so that QA-ST can find a better 
assignment between sub-optimal ones. 

2 Preliminaries 

First of all, we introduce the notation used in this 
paper. We assume we have n data points, and they 
are assigned to k clusters. The assignment of the i- 
th data point is denoted by binary indicator vector 
di. For example, when k is equal to two, we denote 
the z-th data point assigned to the first and the sec- 
ond cluster by di = (1, 0)-^ and = (0, 1)-^, re- 
spectively. The assignment of all data points is also 
denoted by an indicator vector, cr, whose length is k'^ 
because the number of available assignments is k^. a 
is constructed with {ai}2^i, a = (2)iLi where (g) is 
the Kronecker product, which is a special case of the 
tensor product for matrices. Let A and B be matrices 

'^^^ M. Then, A 5= f^^^^ ^^^^ 

^^21 0.22 J \Ci2lB a22B 

(see Minka (2000) for example). Only one element 
in (7 is one, and the others are zero. For example, 
(7 = <Ji(g)a2 = (0, 1, 0, 0)^ when /c = 2, n = 2, the first 
data point is assigned to the first cluster (<Ji = (1, 0)^) 
and the second data point is assigned to the second 
cluster (a2 = (0, 1)"^). We also use k by n matrix Y 
to denote the assignment of all data. 



where A = 



X cluster 1; O cluster 2; + clusters; cluster 4; 

(7i (local optimum) (72 (local optimum) 




(7* (global optimum) 





assignment space defined 
by quantum effect 



Figure 3: Three clustering results by a mixture of four 
Gaussians (i.e. #clusters=4). 

We do not store a in memory whose length is /c^, but 
we store Y. We use a only for the derivation of quan- 
tum annealing. The proposed QA algorithm is like 
parallel m SAs. We denote the j-th SA of the parallel 
SA by (7j. The i-th data point in aj is denoted by 
s.t. (7j = (8)r=i^j\*- When A is a matrix, is the 
matrix exponential of A defined by = X] 



1 Al 

=0 IT ^ 



Y{a) = (cri,cr2, ...,an). 



(1) 



3 Simulated Annealing for Clustering 

We briefiy review simulated annealing (SA) (Kirk- 
patrick et al., 1983) particularly for clustering. SA 
is a stochastic optimization algorithm. An objective 
function is given as an energy function such that a 
better solution has a lower energy. In each step, SA 
searches for the next random solution near the current 



one. The next solution is chosen with a probabihty 
that depends on temperature T and on the energy 
function value of the next solution. SA almost ran- 
domly choose the next solution when T is high, and 
it goes down the hill of the energy function when T is 
low. Slower cooling T increases the probability to find 
the global optimum. 

Algorithm 1 summarizes a SA algorithm for clustering. 
Given inverse temperature (3 = 1/T, SA updates state 
a with, 

l>SA(^;/5) = |exp[-/3E(a)], (2) 

where E{(j) is the energy function of state a, 
and Z is a normalization factor defined by 
Z = 5]]^ exp(— /3£^(cr)). For probabilistic mod- 
els, the energy function is defined by E{a) = 

- logPprob-model(^, Cr) whcrC Pprob-model(^, Cr) IS givCU 

by a probabilistic model and X is data. Note 
PsA{cr;f3 = 1) = Pprob-modei(^l^). For loss-fuuctiou- 
based models (e.g. spectral clustering), which searches 
for a = argmin^ loss{X, a), the energy function is de- 
fined by E{a) = loss{X, a). 

In many cases, the calculation of Z in (2) is intractable. 
Thus, Markov chain Monte Carlo (MCMC) is utilized 
to sample a new state from a current state. In this pa- 
per, we focus on the Gibbs sampler in MCMC meth- 
ods. Each step of the Gibbs sampler draws an assign- 
ment of the i-th data point, a^, from. 



exp i-m^)] 

E^, expM^(a)]' 



(3) 



where means {^j\j i}. Note the denominator 
of (3) is summation over cr^, which is tractable {0{k)). 

4 Quantum Annealing for Clustering 

Our goal of this section is to derive a sampling al- 
gorithm based on quantum annealing (QA) for clus- 
tering. On the way to the goal, our contribution is 
three folds, which are a well- formed quantum effect 
in Section 4.1, an appropriate similarity measure for 
clustering in Section 4.2 and an annealing schedule in 
Section 4.3. 

4.1 Introducing Quantum Effect and the 
Suzuki- Trotter expansion 

Before we apply QA to clustering problems, we set 
them up in a similar fashion to statistical mechanics. 
We reformulate (2) by the following equation. 



where He is a by diagonal matrix. He is called 
(classical) Hamiltonian in physics. For example, we 
have the following He when k = 2 and n = 2. 



/E{a(^^) \ 
^((t(2)) 
S((t(3)) 
V E{a''^y)) 



(5) 



In this example, cr^*^ indicates the t-th assignment of 
k'^ available assignments, i,e. cr^^^ = (1? 0? 0? 0)^^ 
= (0, 1, 0, 0)^, = (0, 0, 1, 0)^ and 

(7(4) = (^0, 0, 0, 1)-^. e~^^^ is the matrix exponen- 
tial in (4). Since He is diagonal, e~^^^ is also di- 
agonal with [e~^'^^]tt = exp(— /?E(cr*^^^)). Hence, we 
find a^t)T^-mc^{t) _ exp(-/3E(cr^*))) and (4) equal 
to (2). In practice, we use MCMC methods to sample 
(T from psa(c";/^) in (4) by (3). This is because we 
do not need to calculate Z and it is easy to evaluate 
^T^-PHcfj^ which is equal to exp{—f3E{a)). 

QA draws a sample from the following equation. 



PQA(a;/3,r) = ±<T^e-''V 



where H is defined by H = He 



(6) 

TYq. TYq represents 



quantum eff'ect. We define TYq by Hq = X^ILi where 
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Efe is the A; by A; identity matrix, and Ifc is the k by 
k matrix of ones, i.e. [lk\ij = 1 for all i and j. For 
example, H is, 



/e((t(i)) -r -r \ 
-r £'((t(2)) -r 
-r E{a(3)) -r 
V -r -r E(a<-^y) J 



(7) 



(4) 



when k = 2 and n = 2. The derived algorithm depends 
on quantum effect Hq. We found our definition of Hq 
worked well. We also tried a couple of Hq. We explain 
a bad example of Hq later in this section. 

QA samples a from (6). For SA, MCMC methods are 
exploited for sampling. However, in quantum mod- 
els, we cannot apply MCMC methods directly to (6) 
because it is intractable to evaluate unlike 
^T^-fdHc^ = exp(-/5E(cr)). This is because e~^'^ is 
not diagonal whereas e~^^^ is diagonal. Thus, we ex- 
ploit the Trotter product formula (Trotter, 1959) to 
approximate (6). If Ai, are symmetric matrices, 

the Trotter product formula gives, exp ^J^/^i = 

[Yli=i exp(Az/m) j + O (^). Note the residual of fi- 
nite m is the order of 1/m. Hence, this approxima- 
tion becomes exact in the limit of m oo. Since 



H = Hc-\- Hq is symmetric, we can apply the Trotter 
product formula to (6). Following (Suzuki, 1976), (6) 
reads the following expression. 

Theorem 4.1. 

Pqa{(ti;(3,T) 

= Y."^PQA-st{ctuct2. cTm; A T) + O (^^^ , (8) 



cr2 



a2 (7. 

where 



PQA-sricTi, ...,cr^;/3,r) 

-l m 



i=i 



0/(/?.r)^ (9) 
s{aj,aj^i) = (10) 

i=l 

/(/3,r)=nlog(^l+ ^J'_ J . (11) 

The derivation from (6) to (8) is called the Suzuki- 
Trotter expansion. We show the details of the deriva- 
tion in Appendix A. (8) means sampling cji from 
Pqa{(^i] r) is approximated by sampling (cri, cr^) 
frompQA-sricri, ....(Tm)- (9) shows pqa-st is similar to 
parallel {psA(crj; P/m)}JLi^ but it has quantum inter- 
action e^(^^'^^+i)/(/5,r)^ j^Q^g -f f(^p^Y) = 0, i.e. T = 
oo, the interaction disappears, and pqa-st becomes 
m independent SAs. s{aj^aj-^i) takes [0,1] where 
5(crj,crj+i) = 1 when aj = CTj+i and s{aj^aj-^i) = 
when <jj and cTj+i are completely different. Thus, we 
call s(<jj,<jj+i) similarity. Even with finite m, we can 
show the approximation in (8) becomes exact after 
enough annealing iterations has passed with our an- 
nealing schedule proposed in Section 4.3^. 

The similarity in (10) depends on quantum effect TYq. 
A different TYq results in a different similarity. For ex- 
ample, we can derive an algorithm with quantum effect 
r(E/er. - Ifc-). gives similarity 5' ((7^,(7^+1 ) = 
Yli=i^{^j,i^^j-\-i,i)- Going back to Fig. 3, we notice 
s{o'i,a2) > but s'(cri,<j2) = 0. In this case, pqa-st 
with is just m independent SAs because interaction 
/ is canceled by s'(cri, (J2) = 0, and pqa-st is unlikely 
to search for a*. On the other hand, Pqa-st with TYq 
is more likely to search for <j* because interaction / 
allows (Ji and a2 to go between ai and (T2. 

Now, we can construct a Gibbs sampler based on 
Pqa-st in a similar fashion to (3). Although the 
sampler is tractable for statistical mechanics, it is in- 
tractable for machine learning. We give a solution to 



^The residual of the approximation in (8) is dominated 
by /3^r with small (3 and large F. Using the annealing 
schedule proposed in Section 4.3, the residual goes to zero 
as annealing continues (/3 ^ and F 00). 




Figure 4: ai and (72 give the same clustering but have 
different cluster labels. Thus, s ((11,(72) = 0. After 
cluster label permutation from (72 to (73, 5((7i, (73) = 1. 
The purity^ s, gives 5((7i, (72) = 1 as well. 




sampling iteration 

Figure 5: The schedules of P and /(/?,F). 

the problem in Section 4.2. We also discuss the an- 
nealing schedule of P and F in Section 4.3, which is a 
crucial point in practice. 

4.2 Cluster-Label Permutation 

Our goal is to make an efficient sampling algo- 
rithm. In a similar fashion to (3), we can con- 
struct a Gibbs sampler pQA-ST(^j,i|{c"}j=i\^j,i) whose 
computational complexity is the same as that of 
(3). However, the sampler can easily get stuck in 
local optima, which is for example PQA-ST(cri, cr2) 
in Fig. 4. If we can draw (73 in Fig. 4 from (72, 
Pqa-st {cTi , (72 ) is a better state than pqa-st (c^i , 0^2 ) i.e. 

PQA-ST(cri,^2) > PQA-ST(cri , Cr2) because 5 ((71,(73) = 

1, 5((7i,(72) = and /(/?, F) > in (9). Since sam- 
pler pQA-ST{o'j,i\{o-}JLi\d-j^i) only changes the label of 
one data point at a time, the sampler cannot sam- 
ple (72 from (72 efficiently. In statistical mechanics, a 
cluster label permutation sampler is applied to cases 
such as Fig. 4. The label permutation sampler does 
not change cluster assignments but draws cluster label 
permutation, e.g. (73 from (72 in one step. In other 
words, the sampler exchanges rows of matrix Y{a) de- 
fined in (1). In the case of Fig. 4, k is equal to four, 
so we have 4! = 24 choices of label permutation. The 
computational complexity of the sampler is 0{k\) be- 
cause its normalization factor requires summation over 
k\ choices. The sampler is tractable for statistical me- 
chanics due to relatively small k. However, it is in- 
tractable for machine learning where k can be very 
large. 

We introduce approximation of pqa-st so that we do 
not need to sample label permutation, whose com- 



Algorithm 1 Simulated Annealing for Clustering 
1: Initialize inverse temperature jS and assignment a. 
2: repeat 

3: for i = 1, n do 

4: Draw the new assignment of the i-th data 

point, di, with a probability given in (3). 
5: end for 

6: Increase inverse temperature (3. 
7: until State a converges 



putational complexity is 0{k\). In particular, we 
replace similarity s{(jj^(jj-^i) in (9) by the purity, 
5(<jj, cTj+i). The purity, ^(crj, <jj+i), is defined by 

where Y is defined in (1), and [A]c,c' denotes the 
(c, c') element of matrix A. In the case of Fig. 4, 
s{cFi,a2) = 1 whereas s{ai,a2) = 0. In general, 
s(cri,cr2) < s(cri,cr2). 

Let aj^i be the i-th data point of assignment aj. The 
update probability of aj^i with the purity is. 



PQA-ST+purity (c^ j,i | {cF j }^i \^ j,i ; T) 



exp 






J2 exp 


-iE{aj) + s>,_i, a,-, )/(/?, r) 



where s(crj_i, dj, cr^+i) = s{aj,aj-i) + ^(crj, aj+i)^. 
The computational complexity of (12) is 0{k'^), but 
caching statistics reduces it to 0{k). Thus, Step 1 in 
Algorithm 2 requires 0{k'^), and Step 5 requires 0{k), 
which is the same as SA. 

Using another representation of aj and a different Wq, 
we can develop a sampler, which does not need label 
permutation. However, its computational complexity 
is 0{n) in Step 5, which is much more expensive than 
0{k). Thus, the sampler is less efficient than the pro- 
posed sampler even though the sampler does not need 
to solve label permutation. 

4.3 Annealing Schedule of (3 and F 

The annealing schedules of /3 and F significantly af- 
fect the result of QA. Thus, it is crucial to use good 
schedules of (3 and F. In this section, we propose the 
annealing schedule of F and p. 

We address two points before proposing a schedule. 
One is our observation from pilot experiments, and 
the other is the balance of f3 and F. From our pilot 

^Note s is not commutative, and take care of the or- 
der of the arguments of s. We use s(crj_i, ctj, ctj+i) = 
s{aj,aj-i) + s(crj, CTj+i), but we omit the reason due to 
space. 



Algorithm 2 Quantum Annealing for Clustering 
1: Initialize inverse temperature (3 and quantum an- 
nealing parameter F. 
2: repeat 

3: for j = 1, m do 
4: for i = 1, n do 

5: Draw the new assignment of the i-th 

data point, (jj^i, with a probability 
given in (12). 

6: end for 

7: end for 

8: Increase inverse temperature /3, and decrease 

QA parameter F. 
9: until State a converges 



experiments, we observe QA-ST works well when it 
can find suboptimal assignments {crj}^i by conver- 
gence. (12) shows QA-ST searches for a better assign- 
ment from suboptimal {o'j}JLi- On the other hand, 
when current {ctjI^i are far away from global opti- 
mum or even sub-optima, QA-ST does not necessarily 
work well. Comparing (12) with (3) in terms of P and 
F, if ^ > /(/5,F), {(Tj}f=i are sampled from psA(cri), 
i.e. no interaction between aj and (Jj-\-i. On the other 
hand, if ^ ^ /(/5,F), {o'j}JLi become very close to 
each other regardless of energy E{(jj). 

From the above discussion, 13/m at the beginning 
should be larger than /(/?, F) and large enough to col- 
lect suboptimal assignments, and /(/3,F) should be- 
come larger than at some point to make {ctj}^]^ 
closer. The best path of [3 and f{[3,T) would be like 
/* in Fig. 5. /i in Fig. 5 is stronger than j3 from the 
beginning, which does not allow QA-ST to search for 
good assignments due to too strong quantum inter- 
action /(/5, F)5((7j_i, (jj, cTj+i) in (12). /2 is always 
smaller than jS. Hence, QA-ST never makes {cr^}^^ 
closer. In other words, QA-ST does not search for 
a middle (hopefully better) assignment from {crj}^i- 
Specifically, we use the following annealing schedule in 
Step 8 in Algorithm 2. 

/5 = /5o4, F = Foexp(-4), (13) 

where r/3 and rr are constants and i denotes the i- 
th iteration of sampling. (13) comes from the follow- 
ing analysis of /(/?,F). When ^ <C 1, (11) reads, 

/(AF) ^ -nlog(^) = -nlog(^) . Thus, 

the path of /(/5,F) become /* in Fig. 5 when Fq is 
large enough and r/3 < rp. In this paper^, we set Fq 
to a large value such that /(/5,F) ?^ until (3 = m. 
This means PQA-ST(cri, •••,crm) is just m independent 

^When QA-ST is applied to loss- function-based mod- 
els, "until (3 = m" should be calibrated according to loss- 
functions. 



instances of psA{crj't P/m) until f3 = m. 

Note there is not much difference of difficulty between 
SA and QA-ST to choose annealing schedules. In gen- 
eral, we should choose the schedule of F to be /* when 
the schedule of P is given. As shown in the next sec- 
tion, QA-ST works weh with rr ~ x 1.05. Thus, the 
difficulty of choosing annealing schedules for QA-ST is 
reduced to that of choosing the schedule of (3 for SA. 

5 Experiments 

We show experimental results in Fig. 6. In the top 
three rows of Fig. 6, we vary the schedule of F with a 
fixed schedule of (3 to see QA-ST work better than the 
best energy of m SAs when the schedule of F lets the 
path of f{P,T) be /* in Fig. 5. In the bottom row of 
the figure, we compare QA-ST and SA with a slower 
schedule of /3. This experiment shows whether QA- 
ST still works better than SA or not while the slower 
schedule of P improves SA. 

We apply SA and QA-ST to two models, which are a 
mixture of Gaussians (MoG) with a conjugate normal- 
inverse- Wishart prior and the latent Dirichlet allo- 
cation (LDA) (Blei et al., 2003). For both mod- 
els, parameters are marginalized out, and E{a) = 
- logp(X, a) where X is data. Thus, QA-ST and SA 
search for maximum a posteriori (MAP) assignment 
cr. MoG is applied to MNIST data set, and LDA is 
applied to NIPS corpus and Reuters. For MNIST, we 
randomly choose 5,000 data points and apply PC A 
to reduce the dimensionality to 20. NIPS corpus has 
1,684 documents, and we randomly choose 1000 words 
in vocabulary. We also randomly choose 1000 doc- 
uments and 2000 words in vocabulary from Reuters. 
We set k to 30, 20 and 20 for MNIST, NIPS corpus 
and Reuters, respectively. We use the same schedule of 
P for QA-ST and SA. In particular, we use the same 
rf3 for QA-ST and SA, and we set po = 0.2 for SA 
and po = 0.2m for QA-ST. The difference of po for 
SA and QA-ST keeps QA-ST similar to SA in terms 
of /^-annealing for fair comparison (see (3) and (12)). 
For each data, we vary rr from 1.02 to 1.20 with fixed 
rp. When rp < rr, the path of /(/?,F) becomes /* in 
Fig. 5. For any data set, m is set to 50 for QA-ST. We 
set m of SAs so that m SAs consume the same time as 
QA-ST. Thus, we can compare QA-ST and m SAs in 
terms of iteration in Fig. 6. Consequently, m of SA was 
set to 51, 55 and 55 for MNIST, NIPS and Reuters, 
respectively^. In Fig. 6, we plot only after P = m for 
QA-ST and /5 = 1 for SA, which happen at the same 
iteration for QA-ST and SA. 

"^QA-ST and m SAs took 21.7 and 22.0 hours for 
MNIST, 62.5 and 62.8 hours for NIPS and 9.9 and 10.0 
hours for Reuters. 



In Fig. 6, the left column and the middle column show 
the minimum and the mean energy of {crj}JLi- Since 
this is an optimization problem, we are interested in 
the minimum energy in the left column. For each data, 
QA-ST with /* achieved better results than SA. The 
right column of Fig. 6 shows the mean of purity s. As 
we expect, the larger rr resulted in the higher s. The 
bottom row of Fig. 6 shows the result of NIPS with the 
slower schedule of P than the schedule in the third row 
of Fig. 6. Although SA found better results than the 
third row of Fig.6, QA-ST stih worked better than SA. 
Our experimental results are also consistent with the 
claim of Matsuda et al. (2009), which is that QA works 
the better than SA for more difficult problems. QA 
worked better for LDA than MoG. The right column 
of Fig.6 shows s of LDA converged to smaller values 
than that of MoG. This means LDA has more local 
optima than MoG. 

In this section, we have shown QA-ST achieves better 
results than SA when the schedule of F is /* in Fig. 5. 
We have also shown even with the slower schedule of 
p, QA-ST stiU works better than SA. 

6 Discussion &; Conclusion 

Many techniques to accelerate sampling have been 
studied. Such techniques can be applied to the pro- 
posed algorithm. For example, the split-merge sampler 
(Richardson and Green, 1997) and the permutation 
augmented sampler (Liang et al., 2007) use a global 
move to escape from local minima. These techniques 
are available for the proposed algorithm as well. We 
can also apply the exchange Monte Carlo method. 

We have applied quantum annealing (QA) to cluster- 
ing. To our best knowledge, this is the ffist study of 
QA for clustering. We have proposed quantum effect 
TYq fit to clustering and derived a QA-based sampling 
algorithm. We have also proposed a good annealing 
schedule for QA, which is crucial for applications. The 
computational complexity of QA is larger than a single 
simulated annealing (SA). However, we have empiri- 
cally shown QA finds a better clustering assignment 
than the best one of multiple-run SAs that are ran- 
domly restarted until they consumes the same time as 
QA. In other words, QA is better than SA when we run 
SA many times. Actually, it is typical to run SA many 
times because SA's fast cooling schedule of tempera- 
ture T does not necessarily find the global optimum. 
Thus, we strongly believe QA is a novel alternative to 
SA for optimizing clustering. In addition, it is easy to 
implement the proposed algorithm because it is very 
similar to SA. 

Unfortunately, there is no proof yet that QA is bet- 
ter than SA in general. Thus, we need to experimen- 
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Figure 6: Comparison between SA and QA varying anneahng schedule, rr, /2 and /* in legends correspond to 
Fig. 5. The left-most column shows what SA and QA found. QA with /* always found better results than SA. 



tally show QA's performance for each problem like this 
paper. However, it is worth trying to develop QA- 
based algorithms for different models, e.g. Bayesian 
networks, by different quantum effect 7iq. The pro- 
posed algorithm looks like genetic algorithms in terms 
of running multiple instances. Studying their relation- 
ship is also interesting future work. 
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A The Details of the Suzuki- Trotter 
Expansion 

Following Suzuki (1976), we give the details of deriva- 
tion of Theorem 4.1. The following Trotter product 
formula (Trotter, 1959) says if ^i,--- , An are sym- 
metric matrices, we have 

E ^' j = f n exp(^z/m) j + o(J^y (14) 

Applying the Trotter product formula to (6), we have 

Note 

= ^afe^a2aJeV. (16) 

0-2 

Hence, we express (15) by marginalizing out auxiliary 
variables {cr^, cr2, (72, cr^, cr^}, 

^crf (^e-^^^e-^^^^^^ ai (17) 



a[ (72 am (t'^ 



(18) 



^ m 



(19) 



where cr^i+i = cri- To simplify (19) more, we use the 
following Lemma A.l and Lemma A. 2. 

Lemma A.l. 



aje--^^^a'j =exp ( -^E{aj) ] S{aj,a'j) 



o^PsA {cTj ; p/m)S{aj , a • ) , (20) 



Proof. By the definition, e is diagonal with 

[e~m'^cj^^ = E{a^^^), and Gj and a'- are binary in- 
dicator vectors, i.e. only one element in <jj is one and 
the others are zero. Thus, the above lemma holds. □ 

Lemma A. 2. 

e-^^^a,-+i oc e^(-;'-.+i)/(/^'n. (21) 

Proof. Substituting {A ^ B){C ^ D) = {AC) {BD) 
and = e^^e^^ when A1A2 = A2A1, we find. 



erf e--^^cr,-+l =crf | 6^6"-^^ I a. 



n 

= I[^Z^~-'^J+i,i, (22) 

where aj^i is the i-th element of Kronecker product of 
(7j, s.t. (7j = (2)ILi^j,*- Substituting the following 
(23) and (24) into (22),' 



/=0 



(23) 



=r'(E,-i,)'=r'|^(^ ^ )eu-i.)'-^| 
=r'|Efe + E( • )fc'-'-^(-i)'-'ife| 
-r'{E. + ^g(;)(-fc)-i.) 

=r'|Efc + i{(l-fc)'-l}lfc|, (24) 



we have 



i=i 1=0 ' ^ ^ ^ ^ 

i=l 1=0 ^ ^ ^ ^ 
^^s(a;,a,+0/(/5,r)_ (25) 



□ 



Using Lemma A.l and Lemma A. 2 into (19), (17) be- 
comes, 



where 6{aj,a'.) = 1 if aj = a'- and S{aj,a'.) = oth- = 7 E E n^SA(cr^; /^M)e 



CTm j = l 



From (15) and the above expression, we have shown 
Theorem 4.1. 



